In this class, we will talk about model explainability but more in the context of data explainability or root cause analysis. In many cases building a very good machine learning model is not an ultimate goal. What is really wanted is the data understanding. A factory wants to know why the product is plagued with a defect, not to predict afterward if there is a defect or not. A football team wants to know which position is the best for scoring a goal, not what's the probability of scoring from a given position. And even when they want a prediction they would love to see the justification to trust the model. Often a nice plot is worth more than sophisticated machine-learning approaches.
import pandas as pd
import numpy as np
import dalex as dx
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_wine
data = load_wine()
data
df = pd.DataFrame(data['data'], columns=data['feature_names'])
y = data['target']
df['target'] = y
You should already be familiar with many data visualization techniques so we will not train it now. I just want to share a less popular type of data analysis. Usually plotting the target against any feature is not helpful but after some modification, we might be able to see some patterns.
plt.plot(df.flavanoids, y, 'bo')
[<matplotlib.lines.Line2D at 0x2373c9a4cd0>]
For each value, we can plot the average target for data:
Please note that for the line "above that value" the more left we go the higher fraction of data is covered. The same with the "below that value"
for col in df.columns.drop('target'):
tmp = df.sort_values(col)
plt.title(col)
plt.plot(tmp[col], tmp[col].apply(lambda x: tmp[tmp[col] <= x].target.mean()), label="<=")
plt.plot(tmp[col], tmp[col].apply(lambda x: tmp[tmp[col] >= x].target.mean()), label=">=")
plt.plot(tmp[col], np.convolve(np.ones(20)/20, tmp.target, mode='same'), label= "~=~")
plt.legend()
plt.grid()
plt.show()
Ok, let's just train a model. We are not interested in top performance right now so we will skip hyperparameter optimization. Also, we want to find the pattern in the data we have, so we don't split the data into validation and test set.
model = RandomForestRegressor()
x = df.drop('target', axis=1)
y = df.target
model.fit(x, y)
RandomForestRegressor()
plt.plot(df.target, model.predict(x), 'bo')
[<matplotlib.lines.Line2D at 0x2373ce9a9d0>]
Dalex is a python package for model explainability. We will use some of its functions to understand the data and the model better. First, we need to create an explainer model. Since we are not interested in checking the model performance but the relation between the data and the target we will use the whole dataset here. In the first case, we might want to use the testing set.
exp = dx.Explainer(model, x, y)
Preparation of a new explainer is initiated -> data : 178 rows 13 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 178 values -> model_class : sklearn.ensemble._forest.RandomForestRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x0000023729F6CC10> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.934, max = 2.0 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.4, mean = 0.00404, max = 0.3 -> model_info : package sklearn A new explainer has been created!
fi = exp.model_parts()
fi
| variable | dropout_loss | label | |
|---|---|---|---|
| 0 | _full_model_ | 0.073009 | RandomForestRegressor |
| 1 | nonflavanoid_phenols | 0.073834 | RandomForestRegressor |
| 2 | total_phenols | 0.074891 | RandomForestRegressor |
| 3 | proanthocyanins | 0.075508 | RandomForestRegressor |
| 4 | malic_acid | 0.076834 | RandomForestRegressor |
| 5 | magnesium | 0.078246 | RandomForestRegressor |
| 6 | ash | 0.078718 | RandomForestRegressor |
| 7 | alcalinity_of_ash | 0.082857 | RandomForestRegressor |
| 8 | hue | 0.105474 | RandomForestRegressor |
| 9 | alcohol | 0.152315 | RandomForestRegressor |
| 10 | od280/od315_of_diluted_wines | 0.177252 | RandomForestRegressor |
| 11 | color_intensity | 0.313098 | RandomForestRegressor |
| 12 | proline | 0.358969 | RandomForestRegressor |
| 13 | flavanoids | 0.637084 | RandomForestRegressor |
| 14 | _baseline_ | 1.088827 | RandomForestRegressor |
The first step will be feature importance. It's a basic analysis where we calculate the global impact of a feature. The idea in dalex default approach is to measure how much the model performance is worsening after removing this feature. Of course, it would require retraining the model, the optimal set of hyperparameters might be different and it might affect the results. To avoid these problems we do not retrain the model. Instead, we simulate its removal by assigning random values to it. To make it more realistic the values are not completely random, we just shuffle this column in a dataframe, do the prediction, check performance and repeat these steps multiple times.
fi.plot()
Another useful tool is a partial dependency plot. For a given feature we observe what's the average output of our model for different values of this feature. For each considered value we set this value for each row in our dataframe and calculate an average prediction.
exp.model_profile().plot()
Calculating ceteris paribus: 100%|█████████████████████████████████████████████████████| 13/13 [00:00<00:00, 24.32it/s]
We can also create similar plots for single rows. Here for each column, we present what would be the output from the model assuming we keep all remaining values and change the value of this one selected feature.
exp.predict_profile(x.iloc[[15,80]]).plot()
Calculating ceteris paribus: 100%|█████████████████████████████████████████████████████| 13/13 [00:00<00:00, 69.52it/s]
SHAP values are equivalents of Shapley values for the predictive models. It estimates the effect of a particular value of a particular feature for a prediction of a considered row. It's also done by replacing this value with proper sampling and replacing this value and measuring the effect on the prediction.
exp.predict_parts(x.iloc[15], type='shap').plot()
exp.predict_parts(x.iloc[15], type='shap').plot()
The result is based on sampling so the result for the same row can vary
exp.predict_parts(x.iloc[88], type='shap').plot()
exp.predict_parts(x.iloc[88], type='shap').result
| variable | contribution | variable_name | variable_value | sign | label | B | |
|---|---|---|---|---|---|---|---|
| 0 | hue = 1.0 | -0.040225 | hue | 1.00 | -1.0 | RandomForestRegressor | 1 |
| 1 | proanthocyanins = 1.35 | -0.000843 | proanthocyanins | 1.35 | -1.0 | RandomForestRegressor | 1 |
| 2 | total_phenols = 1.95 | 0.009213 | total_phenols | 1.95 | 1.0 | RandomForestRegressor | 1 |
| 3 | od280/od315_of_diluted_wines = 2.75 | -0.045112 | od280/od315_of_diluted_wines | 2.75 | -1.0 | RandomForestRegressor | 1 |
| 4 | proline = 680.0 | 0.162978 | proline | 680.00 | 1.0 | RandomForestRegressor | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 8 | nonflavanoid_phenols = 0.48 | 0.002025 | nonflavanoid_phenols | 0.48 | 1.0 | RandomForestRegressor | 0 |
| 9 | ash = 2.46 | 0.001694 | ash | 2.46 | 1.0 | RandomForestRegressor | 0 |
| 10 | magnesium = 84.0 | 0.001294 | magnesium | 84.00 | 1.0 | RandomForestRegressor | 0 |
| 11 | proanthocyanins = 1.35 | -0.001058 | proanthocyanins | 1.35 | -1.0 | RandomForestRegressor | 0 |
| 12 | alcalinity_of_ash = 21.6 | 0.001025 | alcalinity_of_ash | 21.60 | 1.0 | RandomForestRegressor | 0 |
338 rows × 7 columns
Task For each class find the most representative examples and plot breakdown for them.
exp = dx.Explainer(model, x, y)
fi = exp.model_parts()
Preparation of a new explainer is initiated -> data : 178 rows 13 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 178 values -> model_class : sklearn.ensemble._forest.RandomForestRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x0000023729F6CC10> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.934, max = 2.0 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.4, mean = 0.00404, max = 0.3 -> model_info : package sklearn A new explainer has been created!
drop = fi.result[fi.result['dropout_loss'] > .25]
drop = drop[drop['variable'] != '_baseline_']
drop
| variable | dropout_loss | label | |
|---|---|---|---|
| 11 | color_intensity | 0.313662 | RandomForestRegressor |
| 12 | proline | 0.362350 | RandomForestRegressor |
| 13 | flavanoids | 0.621831 | RandomForestRegressor |
representatives = drop['variable']
representatives
11 color_intensity 12 proline 13 flavanoids Name: variable, dtype: object
for_plot = df[representatives]
for_plot['target'] = df['target']
for_plot
| color_intensity | proline | flavanoids | target | |
|---|---|---|---|---|
| 0 | 5.64 | 1065.0 | 3.06 | 0 |
| 1 | 4.38 | 1050.0 | 2.76 | 0 |
| 2 | 5.68 | 1185.0 | 3.24 | 0 |
| 3 | 7.80 | 1480.0 | 3.49 | 0 |
| 4 | 4.32 | 735.0 | 2.69 | 0 |
| ... | ... | ... | ... | ... |
| 173 | 7.70 | 740.0 | 0.61 | 2 |
| 174 | 7.30 | 750.0 | 0.75 | 2 |
| 175 | 10.20 | 835.0 | 0.69 | 2 |
| 176 | 9.30 | 840.0 | 0.68 | 2 |
| 177 | 9.20 | 560.0 | 0.76 | 2 |
178 rows × 4 columns
for idx, var in enumerate(for_plot.columns.drop('target')):
plt.rcParams["figure.figsize"] = (5, 5)
plt.title(var)
plt.plot(for_plot['target'], for_plot[var], 'bo', label=var)
plt.xticks(for_plot['target'].unique())
plt.legend()
plt.show()
There are other approaches that can be used for model explainability.
Send it to gmiebs@cs.put.poznan.pl within 144 hours after the class is finished. Start the subject of the email with [IR]
Team members:
df = pd.read_csv("DATA.csv")
P.S.: Dataset can be found here: https://www.kaggle.com/datasets/muhammad4hmed/monkeypox-patients-dataset
df
| Patient_ID | Systemic Illness | Rectal Pain | Sore Throat | Penile Oedema | Oral Lesions | Solitary Lesion | Swollen Tonsils | HIV Infection | Sexually Transmitted Infection | MonkeyPox | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P0 | None | False | True | True | True | False | True | False | False | Negative |
| 1 | P1 | Fever | True | False | True | True | False | False | True | False | Positive |
| 2 | P2 | Fever | False | True | True | False | False | False | True | False | Positive |
| 3 | P3 | None | True | False | False | False | True | True | True | False | Positive |
| 4 | P4 | Swollen Lymph Nodes | True | True | True | False | False | True | True | False | Positive |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24995 | P24995 | None | True | True | False | True | True | False | False | True | Positive |
| 24996 | P24996 | Fever | False | True | True | False | True | True | True | True | Positive |
| 24997 | P24997 | None | True | True | False | False | True | True | False | False | Positive |
| 24998 | P24998 | Swollen Lymph Nodes | False | True | False | True | True | True | False | False | Negative |
| 24999 | P24999 | Swollen Lymph Nodes | False | False | True | False | False | True | True | False | Positive |
25000 rows × 11 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000 entries, 0 to 24999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Patient_ID 25000 non-null object 1 Systemic Illness 25000 non-null object 2 Rectal Pain 25000 non-null bool 3 Sore Throat 25000 non-null bool 4 Penile Oedema 25000 non-null bool 5 Oral Lesions 25000 non-null bool 6 Solitary Lesion 25000 non-null bool 7 Swollen Tonsils 25000 non-null bool 8 HIV Infection 25000 non-null bool 9 Sexually Transmitted Infection 25000 non-null bool 10 MonkeyPox 25000 non-null object dtypes: bool(8), object(3) memory usage: 781.4+ KB
sum(df.isna().sum()) + sum(df.isnull().sum()) ## no missing values, clear data
0
df.shape[0] - df.drop_duplicates().shape[0]
0
df = df.drop(columns="Patient_ID") ## since it doesn't effect the resulting class
df
| Systemic Illness | Rectal Pain | Sore Throat | Penile Oedema | Oral Lesions | Solitary Lesion | Swollen Tonsils | HIV Infection | Sexually Transmitted Infection | MonkeyPox | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | None | False | True | True | True | False | True | False | False | Negative |
| 1 | Fever | True | False | True | True | False | False | True | False | Positive |
| 2 | Fever | False | True | True | False | False | False | True | False | Positive |
| 3 | None | True | False | False | False | True | True | True | False | Positive |
| 4 | Swollen Lymph Nodes | True | True | True | False | False | True | True | False | Positive |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24995 | None | True | True | False | True | True | False | False | True | Positive |
| 24996 | Fever | False | True | True | False | True | True | True | True | Positive |
| 24997 | None | True | True | False | False | True | True | False | False | Positive |
| 24998 | Swollen Lymph Nodes | False | True | False | True | True | True | False | False | Negative |
| 24999 | Swollen Lymph Nodes | False | False | True | False | False | True | True | False | Positive |
25000 rows × 10 columns
a, b = df['MonkeyPox'].value_counts()
data = [a / (a + b) , b / (a + b)]
f, ax = plt.subplots(figsize=(5, 5))
temp = plt.pie(data, labels=['No', 'Yes'], autopct='%.0f%%', wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' }, colors=['dodgerblue', 'gold'])
ax.set_title('Monkey Pox', fontsize=15, fontweight='bold', pad=0)
Text(0.5, 1.0, 'Monkey Pox')
fig, axes = plt.subplots(2, 4, figsize=(30, 15))
idx = 1
for row in range(2):
for col in range(4):
sns.countplot(ax = axes[row, col], x = df[df.columns[idx]], hue = df['MonkeyPox'], palette = ['#35CEF3', '#495EDB'],
order=[1, 0]).set(xticklabels = ['Yes', 'No'], ylabel = None, xlabel = None)
axes[row, col].set_title(df.columns[idx], fontsize=30, fontweight ='bold', pad=10)
axes[row, col].tick_params(axis='both', labelsize=25)
axes[row, col].legend(title="Monkey Pox", fontsize=15, title_fontsize=20, labels=['Positive', 'Negative'])
idx += 1
sns.set_style("ticks")
df
| Systemic Illness | Rectal Pain | Sore Throat | Penile Oedema | Oral Lesions | Solitary Lesion | Swollen Tonsils | HIV Infection | Sexually Transmitted Infection | MonkeyPox | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | None | False | True | True | True | False | True | False | False | Negative |
| 1 | Fever | True | False | True | True | False | False | True | False | Positive |
| 2 | Fever | False | True | True | False | False | False | True | False | Positive |
| 3 | None | True | False | False | False | True | True | True | False | Positive |
| 4 | Swollen Lymph Nodes | True | True | True | False | False | True | True | False | Positive |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24995 | None | True | True | False | True | True | False | False | True | Positive |
| 24996 | Fever | False | True | True | False | True | True | True | True | Positive |
| 24997 | None | True | True | False | False | True | True | False | False | Positive |
| 24998 | Swollen Lymph Nodes | False | True | False | True | True | True | False | False | Negative |
| 24999 | Swollen Lymph Nodes | False | False | True | False | False | True | True | False | Positive |
25000 rows × 10 columns
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder
binary_enc = OrdinalEncoder()
df = pd.DataFrame(binary_enc.fit_transform(df), columns=df.columns)
df
| Systemic Illness | Rectal Pain | Sore Throat | Penile Oedema | Oral Lesions | Solitary Lesion | Swollen Tonsils | HIV Infection | Sexually Transmitted Infection | MonkeyPox | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 2 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 3 | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 |
| 4 | 3.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24995 | 2.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 24996 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 24997 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 24998 | 3.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 24999 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 |
25000 rows × 10 columns
X, y = df.drop(columns=["MonkeyPox"]), df["MonkeyPox"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
pipe = Pipeline([('OneHotEncoder', OneHotEncoder()), ('RandomForestClassifier', RandomForestClassifier())])
pipe.fit(X_train, y_train)
Pipeline(steps=[('OneHotEncoder', OneHotEncoder()),
('RandomForestClassifier', RandomForestClassifier())])
pipe.score(X_test, y_test)
0.6744
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0.0 0.60 0.36 0.45 925
1.0 0.70 0.86 0.77 1575
accuracy 0.67 2500
macro avg 0.65 0.61 0.61 2500
weighted avg 0.66 0.67 0.65 2500
exp = dx.Explainer(pipe, X_test, y_test)
Preparation of a new explainer is initiated -> data : 2500 rows 9 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 2500 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x0000023729F6CD30> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.632, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.0, mean = -0.00213, max = 1.0 -> model_info : package sklearn A new explainer has been created!
fi = exp.model_parts()
fi.plot()
Since most of the classifiers perform similarly in our case, it was decided to experiment on encoders in pipelines and eventually OneHotEncoder was chosen.
It does not effect the results that much, however it takes much more time to train model when it comes to OneHotEncoder based pipeline (several encoders as well as classifiers were tested).
Apart from that, OrdinalEncoder is found quite advantegeous as a part of data preprocessing when it comes to further dalex Explainer use, since it binarizes the data. Otherwise, model prediction functions (such as the plot above) would not be possible to use.